Method and system for transparent TCP offload with dynamic zero copy sending

ABSTRACT

Certain aspects of a method and system for transparent transmission control protocol (TCP) offload with dynamic zero copy sending are disclosed. Aspects of a method may include enabling direct memory access of data for a particular user application directly from a user buffer to a buffer in a single integrated circuit, for example, a network interface card (NIC), without copying the data to a kernel buffer, based on an occurrence of a number of memory page faults. At least one page in the user buffer comprising data for the particular user application to be transmitted may be marked as a copy-on-write to prevent modification of contents of the buffer before receipt of at least one acknowledgement packet.

CROSS-REFERENCE TO RELATED APPLICATIONS/INCORPORATION BY REFERENCE

This patent application makes reference to, claims priority to and claims benefit from U.S. Provisional Patent Application Ser. No. 60/700,544, filed on Jul. 18, 2005.

This application also makes reference to:

-   U.S. application Ser. No. ______ (Attorney Docket No. 17037US02)     filed on even date herewith; -   U.S. application Ser. No. ______ (Attorney Docket No. 17546US02)     filed on even date herewith; -   U.S. application Ser. No. ______ (Attorney Docket No. 17547US02)     filed on even date herewith; -   U.S. application Ser. No. ______ (Attorney Docket No. 17553US02)     filed on even date herewith; -   U.S. application Ser. No. ______ (Attorney Docket No. 17554US02)     filed on even date herewith; and -   U.S. application Ser. No. ______ (Attorney Docket No. 17556US02)     filed on even date herewith.

Each of the above referenced applications is hereby incorporated herein by reference in their entirety.

FIELD OF THE INVENTION

Certain embodiments of the invention relate to processing of TCP data and related TCP information. More specifically, certain embodiments of the invention relate to a method and system for transparent TCP offload with dynamic zero copy sending.

BACKGROUND OF THE INVENTION

There are different approaches for reducing the processing power of TCP/IP stack processing. In a TCP Offload Engine (TOE), the offloading engine performs all or most of the TCP processing, presenting to the upper layer a stream of data. There may be various disadvantages to this approach. The TTOE is tightly coupled with the operating system and therefore requires solutions that are dependent on the operating system and may require changes in the operating system to support it. The TTOE may require a side by side stack solution, requiring some kind of manual configuration, either by the application, for example, by explicitly specifying a socket address family for accelerated connections. The TTOE may also require some kind of manual configuration by an IT administrator, for example, by explicitly specifying an IP subnet address for accelerated connections to select which of the TCP flows will be offloaded and the offload engine is very complex as it needs to implement TCP packet processing.

Large segment offload (LSO)/transmit segment offload (TSO) may be utilized to reduce the required host processing power by reducing the transmit packet processing. In this approach the host sends to the NIC, bigger transmit units than the maximum transmission unit (MTU) and the NIC cuts them to segments according to the MTU. Since part of the host processing is linear to the number of transmitted units, this reduces the required host processing power. While being efficient in reducing the transmit packet processing, LSO does not help with receive packet processing. In addition, for each single large transmit unit sent by the host, the host would receive from the far end multiple ACKs, one for each MTU-sized segment. The multiple ACKs require consumption of scarce and expensive bandwidth, thereby reducing throughput and efficiency.

In large receive offload (LRO), a stateless receive offload mechanism, the TCP flows may be split to multiple hardware queues, according to a hash function that guarantees that a specific TCP flow would always be directed into the same hardware queue. For each hardware queue, the mechanism takes advantage of interrupt coalescing to scan the queue and aggregate subsequent packets on the queue belonging to the same TCP flow into a single large receive unit.

While this mechanism does not require any additional hardware from the NIC besides multiple hardware queues, it may have various performance limitations. For example, if the number of flows were larger than the number of hardware queues, multiple flows would fall into the same queue, resulting in no LRO aggregation for that queue. If the number of flows is larger than twice the number of hardware queues, no LRO aggregation is performed on any of the flows. The aggregation may be limited to the amount of packets available to the host in one interrupt period. If the interrupt period is short, and the number of flows is not small, the number of packets that are available to the host CPU for aggregation on each flow may be small, resulting in limited or no LRO aggregation, even if the number of hardware queues is large. The LRO aggregation may be performed on the host CPU, resulting in additional processing. The driver may deliver to the TCP stack a linked list of buffers comprising of a header buffer followed by a series of data buffers, which may require more processing than in the case where all the data is contiguously delivered on one buffer.

Accordingly, the computational power of the offload engine needs to be very high or at least the system needs a very large buffer to compensate for any additional delays due to the delayed processing of the out-of-order segments. When host memory is used for temporary storage of out-of-order segments, additional system memory bandwidth may be consumed when the previously out-of-order segments are copied to respective buffers. The additional copying provides a challenge for present memory subsystems, and as a result, these memory subsystems are unable to support high rates such as 10 Gbps.

In general, one challenge faced by TCP implementers wishing to design a flow-through NIC, is that TCP segments may arrive out-of-order with respect to the order placed in which they were transmitted. This may prevent or otherwise hinder the immediate processing of the TCP control data and prevent the placing of the data in a user buffer. Accordingly, an implementer may be faced with the option of dropping out-of-order TCP segments or storing the TCP segments locally on the NIC until all the missing segments have been received. Once all the TCP segments have been received, they may be reordered and processed accordingly. In instances where the TCP segments are dropped or otherwise discarded, the sending side may have to re-transmit all the dropped TCP segments and in some instances, may result in about a fifty percent (50%) decrease in throughput or bandwidth utilization.

There are different approaches for reducing the processing power of TCP/IP stack processing. In a TCP Offload Engine (TOE), the offloading engine performs all or most of the TCP processing, presenting to the upper layer a stream of data. There may be various disadvantages to this approach. The TOE is tightly coupled with the operating system and therefore requires solutions that are dependent on the operating system and may require changes in the operating system to support it. The TOE may require a side by side stack solution, requiring some kind of manual configuration, either by the application, for example, by explicitly specifying a socket address family for accelerated connections. The TOE may also require some kind of manual configuration by an IT administrator, for example, by explicitly specifying an IP subnet address for accelerated connections to select which of the TCP flows will be offloaded and the offload engine is very complex as it needs to implement TCP packet processing.

Large segment offload (LSO)/transmit segment offload (TSO) may be utilized to reduce the required host processing power by reducing the transmit packet processing. In this approach the host sends to the NIC, bigger transmit units than the maximum transmission unit (MTU) and the NIC cuts them to segments according to the MTU. Since part of the host processing is linear to the number of transmitted units, this reduces the required host processing power. While being efficient in reducing the transmit packet processing, LSO does not help with receive packet processing. In addition, for each single large transmit unit sent by the host, the host would receive from the far end multiple ACKs, one for each MTU-sized segment. The multiple ACKs require consumption of scarce and expensive bandwidth, thereby reducing throughput and efficiency.

During conventional TCP processing, each of the plurality of TCP segments received would have to be individually processed by a host processor in the host system. TCP processing requires extensive CPU processing power in terms of both protocol processing and data placement on the receiver side. Current processing systems and methods involve the transfer of TCP state to a dedicated hardware such as a NIC, where significant changes to host TCP stack and/or underlying hardware are required.

When the host processor has to perform a read/write operation, a data buffer has to be allocated in the user space. A read operation may be utilized to copy data from the file into this allocated buffer. A write operation may be utilized to transmit the contents of the buffer to a network. The OS kernel has to copy all data from the user space into the kernel space. Copy operations are CPU and memory bandwidth intensive, limiting system performance.

The host processing power may be consumed by the copying of data between user space and kernel space in the TCP/IP stack. Some solutions have been proposed to reduce the host processing power. For example, utilizing remote direct memory access (RDMA) avoids memory copy in both transmit and receive directions. However, this requires a new application programming interface (API), a new wire protocol, and modifications to existing applications at both sides of the wire. A local DMA engine may be utilized to offload memory copy in both transmit and receive directions. Although a local DMA engine may offload copying operations from the CPU, it does not relieve the memory bandwidth required. The memory bandwidth may be a severe bottleneck in high speed networking applications as platforms are shifting to multiple CPU architectures, with multiple cores in each CPU architecture, all sharing the same memory.

Further limitations and disadvantages of conventional and traditional approaches will become apparent to one of skill in the art, through comparison of such systems with some aspects of the present invention as set forth in the remainder of the present application with reference to the drawings.

BRIEF SUMMARY OF THE INVENTION

A method and/or system for transparent TCP offload with dynamic zero copy sending, substantially as shown in and/or described in connection with at least one of the figures, as set forth more completely in the claims.

These and other advantages, aspects and novel features of the present invention, as well as details of an illustrated embodiment thereof, will be more fully understood from the following description and drawings.

BRIEF DESCRIPTION OF SEVERAL VIEWS OF THE DRAWINGS

FIG. 1A is a block diagram of an exemplary system for transparent TCP offload, in accordance with an embodiment of the invention.

FIG. 1B is a block diagram of another exemplary system for transparent TCP offload, in accordance with an embodiment of the invention.

FIG. 1C is an alternative embodiment of an exemplary system for transparent TCP offload, in accordance with an embodiment of the invention.

FIG. 1D is a block diagram of a system for handling transparent TCP offload, in accordance with an embodiment of the invention.

FIG. 2 is a flow diagram illustrating copying of data for a write operation in a host CPU system that may be utilized in connection with an embodiment of the invention.

FIG. 3 is a flow diagram illustrating transparent TCP offload with dynamic zero copy sending, in accordance with an embodiment of the invention.

DETAILED DESCRIPTION OF THE INVENTION

Certain embodiments of the invention may be found in a method and system for transparent TCP offload with dynamic zero copy sending. Aspects of the method and system may comprise enabling direct memory access of data for a particular user application directly from a user buffer to a buffer in a single integrated circuit, for example, network interface card (NIC), without copying the data to a kernel buffer, based on an occurrence of a number of memory page faults. At least one page in the user buffer comprising data for the particular user application to be transmitted may be marked as a copy-on-write to prevent modification of contents of the buffer before receipt of at least one acknowledgement packet. After receiving an ACK, the copy-on-write attribute may be removed in order to let the application re-use the buffer without penalty.

FIG. 1A is a block diagram of an exemplary system for transparent TCP offload, in accordance with an embodiment of the invention. Accordingly, the system of FIG. 1A may be adapted to handle transparent TCP offload of transmission control protocol (TCP) datagrams or packets. Referring to FIG. 1A, the system may comprise, for example, a CPU 102, a memory controller 104, a host memory 106, a host interface 108, network subsystem 110 and an Ethernet 112. The network subsystem 110 may comprise, for example, a transparent TCP-enabled Ethernet Controller (TTEEC) or a transparent TCP offload engine (TTOE) 114. The network subsystem 110 may comprise, for example, a network interface card (NIC). The host interface 108 may be, for example, a peripheral component interconnect (PCI), PCI-X, PCI-Express, ISA, SCSI or other type of bus. The memory controller 106 may be coupled to the CPU 104, to the memory 106 and to the host interface 108. The host interface 108 may be coupled to the network subsystem 110 via the TTEEC/TTOE 114.

FIG. 1B is a block diagram of another exemplary system for transparent TCP offload, in accordance with an embodiment of the invention. Referring to FIG. 1B, the system may comprise, for example, a CPU 102, a host memory 106, a dedicated memory 116 and a chip set 118. The chip set 118 may comprise, for example, the single integrated circuit or network subsystem 110 and the memory controller 104. The chip set 118 may be coupled to the CPU 102, to the host memory 106, to the dedicated memory 116 and to the Ethernet 112. The network subsystem 110 of the chip set 118 may be coupled to the Ethernet 112. The network subsystem 110 may comprise, for example, the TTEEC/TTOE 114 that may be coupled to the Ethernet 112. The network subsystem 110 may communicate to the Ethernet 112 via a wired and/or a wireless connection, for example. The wireless connection may be a wireless local area network (WLAN) connection as supported by the IEEE 802.11 standards, for example. The network subsystem 110 may also comprise, for example, an on-chip memory 113. The dedicated memory 116 may provide buffers for context and/or data.

The network subsystem 110 may comprise a processor 111. Although illustrated, for example, as a CPU and an Ethernet, the present invention need not be so limited to such examples and may employ, for example, any type of processor and any type of data link layer or physical media, respectively. Accordingly, although illustrated as coupled to the Ethernet 112, the TTEEC or the TTOE 114 of FIG. 1A may be adapted for any type of data link layer or physical media. Furthermore, the present invention also contemplates different degrees of integration and separation between the components illustrated in FIGS. 1A-B. For example, the TTEEC/TTOE 114 may be a separate integrated chip from the chip set 118 embedded on a motherboard or may be embedded in a NIC. Similarly, the processor 111 may be a separate integrated chip from the chip set 118 embedded on a motherboard or may be embedded in a NIC. In addition, the dedicated memory 116 may be integrated with the chip set 118 or may be integrated with the network subsystem 110 of FIG. 1B.

FIG. 1C is an alternative embodiment of an exemplary system for transparent TCP offload, in accordance with an embodiment of the invention. Referring to FIG. 1C, there is shown a host processor 124, a host memory/buffer 126, a software algorithm block 134 and a NIC block 128. The NIC block 128 may comprise a NIC processor 130, a processor such as a coalescer 131 and a reduced NIC memory/buffer block 132. The NIC block 128 may communicate with an external network via a wired and/or a wireless connection, for example. The wireless connection may be a wireless local area network (WLAN) connection as supported by the IEEE 802.11 standards, for example. The processor 131 may be a dedicated processor or hardware state machine that may reside in the packet-receiving path. The host TCP stack may comprise software that enables management of the TCP protocol processing and may be part of an operating system, such as Microsoft Windows or Linux.

In accordance with certain embodiments of the invention, providing a single aggregated TCP segment to the host for TCP processing significantly reduces overhead processing by the host 124. Furthermore, since there is no transfer of TCP state information, dedicated hardware such as a NIC 128 may assist with the processing of received TCP segments by coalescing or aggregating multiple received TCP segments so as to reduce per-packet processing overhead.

In conventional TCP processing systems, it is necessary to know certain information about a TCP connection prior to arrival of a first segment for that TCP connection. In accordance with various embodiments of the invention, it is not necessary to know about the TCP connection prior to arrival of the first TCP segment since the TCP state or context information is still solely managed by the host TCP stack and there is no transfer of state information between the hardware stack and the software stack at any given time.

FIG. 1D is a block diagram of a system for handling transparent TCP offload, in accordance with an embodiment of the invention. Referring to FIG. 1D, there is shown an incoming packet frame 141, a frame parser 143, an association block 149, a context fetch block 151, a plurality of on-chip cache blocks 147, a plurality of off-chip storage blocks 160, a plurality of on-chip storage blocks 162, a RX processing block 150, a frame buffer 154, a DMA engine 163, a TCP code block 157, a host bus 165, and a plurality of host buffers 167. The RX processing block 150 may comprise a coalescer 152.

The frame parser 143 may comprise suitable logic, circuitry and/or code that may enable L2 Ethernet processing including, for example, address filtering, frame validity and error detection of the incoming frames 141. Unlike an ordinary Ethernet controller, the next stage of processing may comprise, for example, L3 such as IP processing and L4 such as TCP processing within the frame parser 143. The TTEEC 114 may reduce the host CPU 102 utilization and memory bandwidth, for example, by processing traffic on coalesced TCP/IP flows. The TTEEC 114 may detect, for example, the protocol to which incoming packets belong based on the packet parsing information and tuple 145. If the protocol is TCP, then the TTEEC 114 may detect whether the packet corresponds to an offloaded TCP flow, for example, a flow for which at least some TCP state information may be kept by the TTEEC 114. If the packet corresponds to an offloaded connection, then the TTEEC 114 may direct data movement of the data payload portion of the frame. The destination of the payload data may be determined from the flow state information in combination with direction information within the frame. The destination may be a host memory 106, for example. Finally, the TTEEC 114 may update its internal TCP and higher levels of flow state, without any coordination with the state of the connection on the host TCP stack, and may obtain the host buffer address and length from its internal flow state.

The receive system architecture may comprise, for example, a control path processing 140 and data movement engine 142. The system components above the control path as illustrated in upper portion of FIG. 1D, may be designed to deal with the various processing stages used to complete, for example, the L3/L4 or higher processing with maximal flexibility and efficiency and targeting wire speed. The result of the stages of processing may comprise, for example, one or more packet identification cards that may provide a control structure that may carry information associated with the frame payload data. This may have been generated inside the TTEEC 114 while processing the packet in the various blocks. A data path 142 may move the payload data portions or raw packets 155 of a frame along from, for example, an on-chip packet frame buffer 154 and upon control processing completion, to a direct memory access (DMA) engine 163 and subsequently to the host buffer 167 via the host bus 165 that was chosen via processing. The data path 142 to the DMA engine may comprise packet data are and optional headers 161.

The receiving system may perform, for example, one or more of the following: parsing the TCP/IP headers 145; associating the frame with a TCP/IP flow in the association block 149; fetching the TCP flow context in the context fetch block 151; processing the TCP/IP headers in the RX processing block 150; determining header/data boundaries and updating state 153; mapping the data to a host buffers; and transferring the data via a DMA engine 163 into these host buffers 167. The headers may be consumed on chip or transferred to the host buffers 167 via the DMA engine 163.

The packet frame buffer 154 may be an optional block in the receive system architecture. It may be utilized for the same purpose as, for example, a first-in-first-out (FIFO) data structure is used in a conventional L2 NIC or for storing higher layer traffic for additional processing. The packet frame buffer 154 in the receive system may not be limited to a single instance. As control path 140 processing is performed, the data path 142 may store the data between data processing stages one or more times.

In an exemplary embodiment of the invention, at least a portion of the coalescing operations described for the coalescer 111 in FIG. 1B and/or for the coalescer 131 in FIG. 1C may be implemented in a coalescer 152 in the RX processing block 150 in FIG. 1D. In this instance, buffering or storage of TCP data may be performed by, for example, the frame buffer 154. Moreover, the flow lookup table (FLT) utilized by the coalescer 152 may be implemented using the off-chip storage 160 and/or the on-chip storage 162, for example.

In an embodiment of the invention, a new flow may be detected at some point during the flow lifetime. The flow state is unknown when the new flow is detected and the first packets are utilized to update the flow state until the flow is known to be in-order. A device performing TTO may also support other offload types, for example, TOE, RDMA, or iSCSI offload. In this case, the FLT for TTO may be shared with the connection search for other offload types with each entry in the FLT indicating the offload type for that flow. Packets that belong to flows of other offload types may not be candidates for TTO. Upon detecting a new flow, the flow may be initiated with the basic initialization context. An entry in the FLT with a flow ID may be created.

In another embodiment of the invention, a plurality of segments of the same flow may be aggregated in TTO up to a receive aggregation length (RAL), presenting to the host a bigger segment for processing. If aggregation is allowed, the received packet may be placed in the host memory 126 but will not be delivered to the host. Instead, the host processor 124 may update the context of the flow this packet belongs to. The new incoming packet may either cause the packet to be delivered immediately alone if there were no prior aggregated packets that were not delivered or as a single packet that represents both that packet and the previously received packets. In another embodiment of the invention, the packet may not be delivered but may update the flow's context.

A termination event may occur and the packet may not be aggregated if at least one of the following occurs at the TCP level: (1) the data is not in-order as derived from the received sequence number (SN) and the flow's context; (2) at least one packet with TCP flags other than ACK flag, for example, a PUSH flag is detected; (3) at least one packet with selective acknowledgement (SACK) information is detected; or (4) if the ACK SN received is bigger than the delivered ACK SN, and requires stopping the aggregation. Similarly, a termination event may occur and the packet may not be aggregated if at least one of the following occurs at the IP level: (1) the type of service (TOS) field in the IP header is different than the TOS field of the previous packets that were aggregated; or (2) the received packet is an IP fragment.

When aggregating a plurality of packets to a single packet, the aggregated packet's header may contain the aggregated header of all the individual packets it contains. In an exemplary embodiment of the invention, a plurality of TCP rules for the aggregation may be as follows. For example, (1) the SN in the aggregated header is the SN of the first or oldest packet; (2) the ACK SN is the SN of the last or youngest segment; (3) the length of the aggregated header is the sum of the lengths of all the aggregated packets; (4) the window in the aggregated header is the window received in the last or youngest aggregated packet; (5) the time stamp (TS) in the aggregated header is the TS received in the first or oldest aggregated packet; (6) the TS-echo in the aggregated header is the TS-echo received in the first or oldest aggregated packet; and (7) the checksum in the aggregated header is the accumulated checksum of all aggregated packets.

In an exemplary embodiment of the invention, a plurality of IP field aggregation rules may be provided. For example, (1) the TOS of the aggregated header may be that of all the aggregated packets; (2) the time-to-live (TTL) of the aggregated header is the minimum of all incoming TTLs; (3) the length of the aggregated header is the sum of the lengths in the aggregated packets; (4) the fragment offset of the aggregated header may be zero for aggregated packets; and (5) the packet ID of the aggregated header is the last ID received.

The received packets may be aggregated until the received packet cannot be aggregated due to the occurrence of a termination event, or if a timeout has expired on that flow, or if the aggregated packet exceeds RAL. The timeout may be implemented by setting a timeout to a value, timeout aggregation value, when the first packet on a flow is placed without delivery. The following packets that are aggregated may not change the timeout. When the packets are delivered due to timeout expiration the timeout may be canceled and may be set again in the next first packet that is not delivered. Notwithstanding, other embodiments of the invention may provide timeout implementation by periodically scanning all the flows.

In an exemplary embodiment of the invention, the received ACK SN may be relevant to determine the rules to aggregate pure ACKs and to determine the rules to stop aggregation of packets with data due to the received ACK SN. The duplicated pure ACKs may never be aggregated. When duplicated pure ACKs are received, they may cause prior aggregated packets to be delivered and the pure ACK may be delivered immediately separately. The received ACK SN may also be utilized to stop the aggregation and deliver the pending aggregated packet to the host TCP/IP stack.

In an exemplary embodiment of the invention, a plurality of rules may be provided for stopping the aggregation according to the ACK SN. For example, (1) if the number of acknowledged (ACKed) bytes that are not yet delivered, taking into account the received segments and the prior segments that were not delivered exceeds a threshold, ReceiveAckedBytesAggretation, for example, in bytes; or (2) the time from the arrival of the first packet that advanced the received ACK SN exceeds a threshold, TimeoutAckAggregation, for example. For this purpose, a second timer per flow may be required or other mechanisms, such as periodically scanning the flows may be implemented.

In another exemplary embodiment of the invention, the flows may be removed from the host memory if one of the following occurs: (1) a reset (RST) flag was detected in the receive side; (2) a finish (FIN) flag was detected in the receive side; (3) there was no receive activity on the flow for a predefined time TerminateNoActivityTime, for example; (4) a KeepAlive packet in the receive direction was not acknowledged. A least recently used (LRU) cache may be used instead of a timeout rule to remove the flows from the host memory.

In another exemplary embodiment of the invention, the flows may be removed from the host memory if the flow was closed due to a retransmission timeout that requires information from the transmitter. In one exemplary embodiment of the invention, retransmission timeout may comprise periodically scanning all the flows to determine if any flow is closed. The period for scanning may be low, for example, 5 seconds. In each scan, if there is unacknowledged data that was transmitted by the NIC 128 the maximum transmitted sequence number (SN) may be recorded. Additionally, if there is unacknowledged data that was transmitted by the peer side, the maximum received SN may be recorded. If in two consequent scans there is pending data on same flow of the same type with the recorded number unchanged, pending data that was not acknowledged for the entire scan period may be indicated. In this case the flow may be removed.

FIG. 2 is a flow diagram illustrating copying of data for a write operation in a host CPU system, in accordance with an embodiment of the invention. Referring to FIG. 2, there is shown various context switches, a user buffer 264, a kernel buffer 262, a NIC memory 268, and copy operations to be performed for a write operation. In step 252, the write system call may cause a context switch from user mode to kernel mode. A copy may be performed to transfer the data into a kernel buffer 262. In step 254, the write system call may return, creating another context switch. Another copy may occur as the DMA engine may pass the data from the kernel buffer 262 to the NIC memory 268 independently and asynchronously. The data duplication may not be necessary and may be eliminated to decrease overhead and increase performance. To eliminate overhead, some of the copying between the kernel buffer 262 and user buffers 264 may be eliminated.

With the send zero-copy approach, the data may be read immediately from the disk into the OS cache memory using direct memory access (DMA) hardware. Data to be transferred may be taken directly from system buffers, without context switching, and without trashing the cache. Thus, the usage of send zero copy in server applications may significantly reduce CPU load.

The user buffer 264 may be posted to the NIC 128 for DMA transfer and the user pages may be marked with a copy on write (COW) attribute, so that if the user attempts writing to them, the kernel may trap these writes. The page may be copied to a different location and the new page may be remapped to the user application. Zero copy may be implemented not only on SendFile operations, but rather on send operations, by detecting the application usage pattern and selecting zero copy for applications that do not attempt writing to send buffers immediately.

FIG. 3 is a flow diagram illustrating transparent TCP offload with dynamic zero copy sending, in accordance with an embodiment of the invention. Referring to FIG. 3, there is shown a user application 302, a socket interface 304, an INET stack 305, a NIC driver 310, a host system interface 312, a NIC 314, an Ethernet layer 316 and a remote peer 318. The INET stack 305 may comprise a system call 306 and a stack 308.

The user application 302 may transmit a message to the remote peer 318. The message may comprise a send message in case of a synchronous API or an asynchronous write message (AIO_write) with a user space description of the user buffer 126 (FIG. 1C). The system call 306 may pin the user buffer 126 to skip copying the user data to a kernel buffer and allow the NIC 314 to write the user data directly from the user's user buffer 126. In the case of a synchronous API, the system call 306 may prevent the user application 302 from modifying the contents of the user buffer 126 until the sent data is acknowledged (ACKed) by the remote peer 318. The system call 306 may enable marking a portion of a received data packets as copy-on-write (COW) to prevent the user application 302 from modifying the contents of the user buffer 126.

The system call 306 may enable generation of a buffer in the NIC 132 with sufficient room for the headers that may be added by the lower network layers, and the physical translation of the user application's 302 I/O vector. The system call 306 may enable registration of a destructor for the buffer in the NIC 132 so that it may unpin the pages when it may be time to free the user buffer 126. The system call 306 may further transmit the user buffer 126 to the stack 308 for processing. The stack 308 may pass the user buffer 126 to the NIC driver 310, which in turn transmits the data to the NIC 314 and the remote peer 318. The remote peer 318 may process the received TCP segment and transmit an ACK to the stack 308 for processing. When the data has been ACKed, the stack 308 may free the buffer in the NIC 132 by instructing the destructor to unpin the pages.

In the case of a synchronous API, the stack 308 may instruct the destructor to unCOW the pages or disable the copy-on-write protection. The stack 308 may complete the zero copy operation. In the case of asynchronous I/O mode, the stack 308 may not be modified for zero copy send operation. The stack 308 may enable monitoring the number of page faults for a particular user application 302 based on enabling copy-on-write (COW). If the number of page faults for a particular user application 302 based on enabling copy-on-write exceeds a certain threshold, that particular user application 302 may not be chosen for zero copying operation and the host stack 308 may mark the particular user application 302 to be zero copy disabled.

A primary difference between a synchronous API mode and an asynchronous mode of operation is that when a send message returns, the user application 302 may be free to immediately modify the user buffer 126. This does not imply that the user application 302 accesses the user buffer 126 immediately, but implies that the user application 302 may be allowed to access the user buffer 126, if necessary.

Some user applications 302 may access the user buffer 126 immediately after the send operation is completed. In these cases, there may be no gain in not copying the data and marking the page as COW, because when the user tries to access the page, a fault may be generated and the page may be copied, which costs more than copying would have cost in the first place. Therefore, the zero copy on sendmsg( ) in these cases may not be efficient. As a result, egress zero copy may be supported, for example, on the sendfile( ) API in an operating system.

In accordance with an embodiment of the invention, a method and system for transparent TCP offload with dynamic zero copy sending may reduce memory copy in the transmit direction and may include detecting user applications 302 that do not access the user buffer 126 immediately after the send operation. These user applications 302 may benefit from the dynamic zero copy because they may avoid COW page faults compared to other user applications that may access the user buffer 126 immediately after the send operation. In another embodiment of the invention, zero copy may be disabled for user applications whose number of COW page faults exceeds a certain threshold. The dynamic zero copy operation avoids the duplication of data between the user buffer 126 and the kernel buffer. In addition, there may be fewer context switches, less CPU data cache pollution and no CPU checksum calculations.

In another embodiment of the invention, a system for processing network information may comprise a host processor 124 (FIG. 1C) that enables direct memory access of data for a particular user application 302 directly from a user buffer 126 to a buffer in a network interface card (NIC) 132 without the data being copied to a kernel buffer based on an occurrence of a number of memory page faults. The host processor 124 may enable the direct memory access of data from the user buffer 126 to the buffer in the NIC 132 without data being copied to the kernel buffer, if the number of memory page faults is below a particular threshold. The host processor 124 may enable marking of at least one page in the user buffer comprising data for the particular user application 302 to be transmitted as a copy-on-write to prevent modification of contents of the user buffer 126 before receiving at least one acknowledgement packet. The host processor 124 may enable generation of the memory page fault, if the particular user application 302 accesses the user buffer 126 before receiving at least one acknowledgement packet.

The host processor 124 may enable pinning of the user buffer 126 to skip copying of data for the particular user application 302 to the kernel buffer. The host processor 124 may enable unpinning of at least one page in the user buffer 126 comprising data for the particular user application 302 after receiving at least one acknowledgement packet. The host processor 124 may enable unmarking of at least one page in the user buffer 126 comprising data for the particular user application 302 as a copy-on-write after receiving at least one acknowledgement packet. The host processor 124 may disable the copying of data directly from the user buffer 126 to the buffer in the NIC 132 without the data being copied to kernel buffer, if the number of memory page faults is above a particular threshold. The host processor 124 may enable registration of a destructor for the buffer in the NIC 132. The host processor 124 may enable deallocation of the buffer in the NIC 132 by accessing the destructor after receiving at least one acknowledgement packet.

Accordingly, the present invention may be realized in hardware, software, or a combination of hardware and software. The present invention may be realized in a centralized fashion in at least one computer system, or in a distributed fashion where different elements are spread across several interconnected computer systems. Any kind of computer system or other apparatus adapted for carrying out the methods described herein is suited. A typical combination of hardware and software may be a general-purpose computer system with a computer program that, when being loaded and executed, controls the computer system such that it carries out the methods described herein.

The present invention may also be embedded in a computer program product, which comprises all the features enabling the implementation of the methods described herein, and which when loaded in a computer system is able to carry out these methods. Computer program in the present context means any expression, in any language, code or notation, of a set of instructions intended to cause a system having an information processing capability to perform a particular function either directly or after either or both of the following: a) conversion to another language, code or notation; b) reproduction in a different material form.

While the present invention has been described with reference to certain embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the scope of the present invention. In addition, many modifications may be made to adapt a particular situation or material to the teachings of the present invention without departing from its scope. Therefore, it is intended that the present invention not be limited to the particular embodiment disclosed, but that the present invention will include all embodiments falling within the scope of the appended claims. 

1. A method for processing network information, the method comprising: enabling direct memory access of data, based on an occurrence of a number of memory page faults, for a particular user application from a user buffer to a buffer in a single integrated circuit without said data being copied to a kernel buffer.
 2. The method according to claim 1, further comprising if said number of said number of memory page faults is less than a particular threshold, enabling direct memory access of said data from said user buffer to said buffer in said single integrated circuit without said data being copied to said kernel buffer.
 3. The method according to claim 1, further comprising marking at least one page in said user buffer comprising said data for said particular user application to be transmitted as a copy-on-write to prevent modification of contents of said user buffer before receiving at least one acknowledgement packet.
 4. The method according to claim 3, further comprising generating at least one of said number of memory page faults, if said particular user application accesses said user buffer before receiving said at least one acknowledgement packet.
 5. The method according to claim 3, further comprising pinning said user buffer to skip copying of said data for said particular user application to said kernel buffer.
 6. The method according to claim 3, further comprising unpinning said at least one page in said user buffer comprising data for said particular user application after receiving said at least one acknowledgement packet.
 7. The method according to claim 3, further comprising unmarking said at least one page in said user buffer comprising data for said particular user application as copy-on-write after receiving said at least one acknowledgement packet.
 8. The method according to claim 1, further comprising if said number of said memory page faults is greater than a particular threshold, disabling said copying of said data directly from said user buffer to said buffer in said single integrated circuit without said data being copied to said kernel buffer.
 9. The method according to claim 1, further comprising registering a destructor for said buffer in said single integrated circuit.
 10. The method according to claim 9, further comprising deallocating said buffer in said single integrated circuit by accessing said destructor after receiving said at least one acknowledgement packet.
 11. A system for processing network information, the system comprising: a host processor that enables direct memory access of data, based on an occurrence of a number of memory page faults, for a particular user application directly from a user buffer to a buffer in a single integrated circuit without said data being copied to a kernel buffer based on an occurrence of a number of memory page faults.
 12. The system according to claim 11, wherein said host processor copies said data directly from said user buffer to said buffer in said single integrated circuit without said data being copied to said kernel buffer, if said number of said number of memory page faults is less than a particular threshold.
 13. The system according to claim 11, wherein said host processor enables marking of at least one page in said user buffer comprising said data for said particular user application to be transmitted as a copy-on-write to prevent modification of contents of said marked at least one page in said user buffer before receiving at least one acknowledgement packet.
 14. The system according to claim 13, wherein said host processor enables generation of at least one of said memory page faults, if said particular user application accesses said user buffer before receiving said at least one acknowledgement packet.
 15. The system according to claim 13, wherein said host processor enables pinning of said user buffer to skip copying of said data for said particular user application to said kernel buffer.
 16. The system according to claim 13, wherein said host processor enables unpinning of said at least one page in said user buffer comprising data for said particular user application after receiving said at least one acknowledgement packet.
 17. The system according to claim 13, wherein said host processor enables unmarking of said at least one page in said user buffer comprising data for said particular user application as copy-on-write after receiving said at least one acknowledgement packet.
 18. The system according to claim 11, wherein said host processor disables said copying of said data directly from said user buffer to said buffer in said single integrated circuit without said data being copied to said kernel buffer, if said number of said memory page faults is greater than a particular threshold.
 19. The system according to claim 11, wherein said host processor enables registration of a destructor for said buffer in said single integrated circuit.
 20. The system according to claim 19, wherein said host processor enables deallocation of said buffer in said single integrated circuit by accessing said destructor after receiving said at least one acknowledgement packet.
 21. The system according to claim 11, wherein said single integrated circuit is a network interface card. 